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Abstract 

Automatic text tagging is an important 
component in higher level analysis of text 
^^ corpora, and its output can be used in 

Q^ many natural language processing applica- 

^\ tions. In languages like Turkish or Finnish, 

^-H with agglutinative morphology, morpholog- 

,— I ical disambiguation is a very crucial pro- 

3 cess in tagging, as the structures of many 

' lexical forms are morphologically ambigu- 

ON ous. This paper describes a POS tagger for 

Cn Turkish text based on a full-scale two-level 

specification of Turkish morphology that is 
\^ based on a lexicon of about 24,000 root 

CnI words. This is augmented with a multi- 

^^ word and idiomatic construct recognizer, 

r and most importantly morphological dis- 

^P ambiguator based on local neighborhood 

^ constraints, heuristics and limited amount 

^i^ of statistical information. The tagger also 

bJ[) has functionality for statistics compilation 

' I ' and fine tuning of the morphological an- 

O I alyzer, such as logging erroneous morpho- 

Ch logical parses, commonly used roots, etc. 

fj Preliminary results indicate that the tag- 

ger can tag about 98-99% of the texts ac- 
curately with very minimal user interven- 
tion. Furthermore for sentences morpho- 
logically disambiguated with the tagger, an 
LFG parser developed for Turkish, gener- 
ates, on the average, 50% less ambiguous 
parses and parses almost 2.5 times faster. 
The tagging functionality is not specific to 
Turkish, and can be applied to any lan- 
guage with a proper morphological analysis 
interface. 

1 Introduction 

As a part of large scale project on natural language 
processing for Turkish, we have undertaken the de- 
velopment of a number of tools for analyzing Turk- 
ish text. This paper describes one such tool - a text 
tagger for Turkish. The tagger is based on a full 



scale two-level morphological specification of Turk- 
ish (Ofiazer, 1993), implemented on the PC-KIMMO 
environment (Antworth, 1990). In this paper, we de- 
scribe the functionality and the performance of our 
tagger along with various techniques that we have 
employed to deal with various sources of ambigui- 
ties. 



2 Tagging Text 

Automatic text tagging is an important step in dis- 
covering the linguistic structure of large text cor- 
pora. Basic tagging involves annotating the words 
in a given text with various pieces of information, 
such as part-of-speech and other lexical features. 
Part-of-speech tagging facilitates higher-level analy- 
sis, such as parsing, essentially by performing a cer- 
tain amount of ambiguity resolution using relatively 
cheaper methods. 

The most important functionality of a tagger is 
the resolution of the structure and parts-of-speech of 
the lexical items in the text. This, however, is not a 
very trivial task since many words are in general am- 
biguous in their part-of-speech for various reasons. 
In English, for example a word such as make can 
be verb or a noun. In Turkish, even though there 
are ambiguities of such sort, the agglutinative na- 
ture of the language usually helps resolution of such 
ambiguities due to morphotactical restrictions. On 
the other hand, this very nature introduces another 
kind of ambiguity, where a lexical form can be mor- 
phologically interpreted in many ways. For example, 
the word evm, can be broken down as:"'^ 

evin POS English 

1. N(ev)-F2SG-P0SS N (your) house 

2. N(ev)-FGEN N of the house 

3. N(evin) N wheat germ 

If, however, the local context is considered, it may 
be possible to resolve the ambiguity as in: 



^ Output of the morphological analyzer is edited for 
clarity. 



sen-Ill 

PN(you)+GEN 

your 



ev-in .. 

N(ev)+2SG-P0SS 
house 



evin kapi-si .. 

N(ev)+GEN N(door)+3SG-P0SS 

door of the house 

using genitive-possessive agreement constraints. 
As a more complex case we can give the following: 
alinini§ 

1 ADJ(al)+2SG-POSS+NtoV()+NARR+3SG^ 
(V) (it) was your red (one) 

2 ADJ(al)+GEN+NtoV()+NARR+3SG 
(V) (it) belongs to the red (one) 

3 N(alin)+NtoV()+NARR+3SG 
(V) (it) was a forehead 

4 V(al)+PASS+VtoAdj(mis) 
(ADJ) (a) taken (object) 

5 V(al)+PASS+NARR+3SG 
(V) (it) was taken 

6 V(alin)-|-VtoAdj(mis) 
(ADJ) (an) offended (person) 

7 V(alin)+NARR+3SG 
(V) (s/he) was offended 

It is in general rather hard to select one of these 
interpretations without doing substantial analysis of 
the local context, and even then one can not fully 
resolve such (usually semantic) ambiguities. 

An additional problem that can be off-loaded to 
the tagger is the recognition of multi-word or id- 
iomatic constructs. In Turkish, which abounds with 
such forms, such a recognizer can recognize these 
very productive multi-word constructs, like 

kos-a kos-a 

run-F0PT-F3SG run-F0PT-F3SG 

yap-ar yap-ma-z 

do-FA0R-F3SG do-FNEG-FA0R-F3SG 
where both components are verbal but the com- 
pound construct is a manner or temporal adverb. 
This relieves the parser from dealing with them at 
the syntactic level. Furthermore, it is also possible 
to recognize various proper nouns with this func- 
tionality. Such help from a tagging functionality 
would simplify the development of parsers for Turk- 
ish (Demir, 1993; GiingSrdii, 1993). 

Researchers have used a number of different ap- 
proaches for building text taggers. Karlsson (Karls- 
son, 1990) has used a rule-based approach where 
the central idea is to maximize the use of mor- 
phological information. Local constraints expressed 
as rules basically discard many alternative parses 
whenever possible. Brill (Brill, 1992) has designed 
a rule-based tagger for English. The tagger works 
by automatically recognizing rules and remedying 
its weaknesses, thereby incrementally improving its 
performance. More recently, there has been a rule- 



In Turkish, all adjectives can be used as nouns, hence 
with very minor differences adjectives have the same 
morphotactics as nouns. 



based approach implemented with finite-state ma- 
chines (Koskenniemi et al., 1992; Voutilainen and 
Tapanainen, 1993). 

A completely different approach to tagging uses 
statistical methods, (e.g., (Church, 1988; Cutting et 
al., 1993)). These systems essentially train a statis- 
tical model using a previously hand-tagged corpus 
and provide the capability of resolving ambiguity on 
the basis of most likely interpretation. The models 
that have been widely used assume that the part-of- 
speech of a word depends on the categories of the two 
preceding words. However, the applicability of such 
approaches to word-order free languages remains to 
be seen. 

2.1 An example 

We can describe the process of tagging by showing 
the analysis for the sentence: 

I§ten doner donmez evmitzm yakmmda bulunan 
derm golde yiizerek gev§emek en hiiyuk zevkmidt. 

(Relaxing by swmirmng the deep lake near our 
house, as soon as I return from work was my greatest 
pleasure.) 

which we assume has been processed by the morpho- 
logical analyzer with the following output: 

i§ten POS 

1. N(i§)-FABL N-F 

doner 

1. N(d6ner) N 

2. V(d6n)-FAOR-F3SG \+ 

3. V(d6n)-FVtoAdj(er) ADJ 
donmez 

1. V(d6n)-FNEG-FAOR-F3SG V+ 

2. V(d6n)-FVtoAdj(mez) ADJ 
eviniizin 

1. N(ev)-MPL-POSS-FGEN N-F 

yakininda 

1. ADJ(yakin)-F3SG-POSS-FLOC N-F 

2. ADJ(yakin)-F2SG-POSS-FLOC N 
bulunan 

1. V(bul)-FPASS-FVtoADJ(yan) ADJ 

2. V(bulun)-FVtoADJ(yan) ADJ-F 
derin 

1. N(deri)-F2SG-POSS N 

2. ADJ(derin) ADJ-F 

3. V(der)-MMP-F2PL V 

4. V(de)-FVtoADJ(er)-F2SG-POSS N 

5. V(de)-FVtoADJ(er)-FGEN N 
golde 

1. N(g61)-FLOC N-F 

yiizerek 
1. V(yiiz)-FVtoADV(yerek) ADV-F 

gev§eniek 
1. V(gev§e)-FVtoINF(mak) \+ 

en 

1. N(en) N 

2. ADV(en) ADV-F 
biiyiik 

1. ADJ(biiyiik) ADJ-F 

zevkinidi 
1. N(zevk)-MSG-POSS-F \+ 

NtoV()-FPAST-F3SG 



Although there are a number of choices for tags 
for the lexical items in the sentence, almost all ex- 
cept one set of choices give rise to ungrammatical or 
implausible sentence structures.^ There are number 
of points that are of interest here: 

• the construct doner donmez formed by two 
tensed verbs, is actually a temporal adverb 
meaning ,,, as soon as .. return(s), hence these 
two lexical items can be coalesced into a single 
lexical item and tagged as a temporal adverb. 

• The second person singular possessive interpre- 
tation of yahnmda is not possible since this 
word forms a simple compound noun phrase 
with the previous lexical item and the third per- 
son singular possessive morpheme functions as 
the compound marker, agreeing with the agree- 
ment of the previous genitive case-marked form. 

• The word derm (deep) is the modifier of a sim- 
ple compound noun derm gol (deep lake) hence 
the second choice can safely be selected. The 
verbal root in the third interpretation is very 
unlikely to be used in text, let alone in sec- 
ond person imperative form. The fourth and 
the fifth interpretations are not very plausible 
either. The first interpretation (meaning your 
skm) may be a possible choice but can be dis- 
carded in the middle of a longer compound noun 
phrase. 

• The word en preceding an adjective indicates 
a superlative construction and hence the noun 
reading can be discarded. 

3 The Tagging Tool 

The tagging tool that we have developed integrates 
the following functionality with a user interface, as 
shown in Figure 1, implemented under X- windows. 
It can be used interactively, though user interaction 
is very rare and (optionally) occurs only when the 
disambiguation can not be done by the tagger. 

1. Morphological analysis with error logging, 

2. Multi-word and idiomatic construct recogni- 
tion, 

3. Morphological disambiguation by using con- 
straints, heuristics and certain statistics, 

4. Root and lexical form statistics compilation. 

The second and the third functionalities are imple- 
mented by a rule-base subsystem which allows one 
to write rules of the following form: 

Ci:Ai; C2:A2; ... C„:A„. 

where each C,- is a set of constraints on a lexical form, 
and the corresponding A,- is an action to be executed 
on the set of parses associated with that lexical form, 
only when all the conditions are satisfied. 



The conditions refer to any available morpholog- 
ical or positional feature associated with a lexical 
form such as: 

• Absolute or relative lexical position (e.g., sen- 
tence initial or final, or 1 after the current word, 
etc.) 

• root and final POS category, 

• derivation type, 

• case, agreement (number and person), and cer- 
tain semantic markers, for nominal forms, 

• aspect and tense, subcategorization require- 
ments, verbal voice, modality,and sense for ver- 
bal forms 

• subcategorization requirements for postposi- 
tions. 

Conditions may refer to absolute feature values or 
variables (as in Prolog, denoted by the prefix _ in the 
following examples) which are then used to link con- 
ditions. All occurrences of a variable have to unify 
for the match to be considered successful. This fea- 
ture is powerful and and lets us specify in a rather 
general way, (possibly long distance) feature con- 
straints in complex NPs, PPs and VPs. This is a 
part of our approach that distinguishes it from other 
constraint-based approaches. 

The actions are of the following types: 

• Null action: Nothing is done on the matching 
parse. 

• Delete: Removes the matching parse if more 
than one parse for the lexical form are still in 
the set associated with the lexical form. 

• Output: Removes all but the matching parse 
from the set effectively tagging the lexical form 
with the matching parse. 

• Compose: Composes a new parse from various 
matching parses, for multi-word constructs. 

These rules are ordered, and applied in the given 
order and actions licensed by any matching rule are 
applied. One rule formalism is used to encode both 
multi-word constructs and constraints. 

3.1 The Multi-word Construct Processor 

As mentioned before, tagging text on lexical item ba- 
sis may generate spurious or incorrect results when 
multiple lexical items act as single syntactic or se- 
mantic entity. For example, in the sentence §irin mi 
§irin bir kopek ko§a ko§a geldi (A very cute dog came 
running) the fragment §irin mi §irin constitutes a 
duplicated emphatic adjective in which there is an 
embedded question suffix mi (written separately in 
Turkish),'* and the fragment ko§a ko§a is a dupli- 
cated verbal construction, which has the grammat- 
ical role of manner adverb in the sentence, though 



The correct choices of tags are marked with -|-. 



*If, however, the adjective §irin was not repeated, 
then we would have a question formation. 
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Figure 1: User interface of tagging tool 



both of the constituent forms are verbal construc- 
tions. The purpose of the multi-word construct pro- 
cessor is to detect and tag such productive con- 
structs in addition to various other semantically co- 
alesced forms such as proper nouns, etc. 

The following is a set of multi-word constructs for 
Turkish that we handle in our tagger. This list is 
not meant to be comprehensive, and new construct 
specifications can easily be added. It is conceivable 
that such a functionality can be used in almost any 
language. 

1. duplicated optative and 3SG verbal forms func- 
tioning as manner adverb, e.g., ko§a ko§a, aorist 
verbal forms with root duplications and sense 
negation functioning as temporal adverbs, e.g., 
yapar yapmaz, and duplicated verbal and de- 
rived adverbial forms with the same verbal root 
acting as temporal adverbs, e.g., gitti gtdelt, 

2. duplicated compound nominal form construc- 
tions that act as adjectives, e.g., giizeller giizelt, 
and emphatic adjectival forms involving the 
question suffix, e.g., giizel rrn giizel, 

3. adjective or noun duplications that act as man- 
ner adverbs, e.g., htzh htzh, ev ev, 

4. idiomatic word sequences with specific usage 
whose semantics is not compositional, e.g., yam 
stra, htg olmazsa, and idiomatic forms which are 
never used singularly, e.g., giirul giirul, 

5. proper nouns, e.g., Jimmy Carter, Topkapi 



Sarayt (Topkapi Palace). 

6. compound verb formations which are formed by 
a lexically adjacent, direct or oblique object and 
a verb, which for the purposes of syntactic anal- 
ysis, may be considered as single lexical item. 

We can give the following example for specifying 
a multi-word construct:® 

Lex=_Hl, Root=_Rl, Cat=V, Aspect=AOR, Agr=3SG, 

Sense=POS: ; 
Lex=_H2, Root=_Rl, Cat=V, Aspect=AOR, Agr=3SG, 

Sense = NEG : 
Compose=((*CAT* ADV) (*R* "_H1 _H2 (_R1)") 

(*SUB* TEMP)) . 

This rule would match any adjacent verbal lexical 
forms with the same root, both with the aorist as- 
pect, and 3SG agreement. The first verb has to be 
positive and the second one negated. When found, 
a composite lexical form with an temporal adverb 
part-of-speech, is then generated. The original ver- 
bal root may be recovered from the root of the com- 
posed form for any subcategorization checks, at the 
syntactic level. 

3.2 Using constraints for morphological 
ambiguity resolution 

Morphological analysis does not have access to syn- 
tactic context, so when the morphological structure 



^The output of the morphological analyzer is actually 
a, feature-valueXisi in the standard LISP format. 



of a lexical form has several distinct analyses, it 
is not possible to disambiguate such cases except 
maybe by using root usage frequencies. For disam- 
biguation one may have to use information provided 
by sentential position and the local morphosyntac- 
tic context. Voutilainen and Heikkila (Voutilainen et 
al., 1992) have proposed a constraint grammar ap- 
proach where one specifies constraints on the local 
context of a word to disambiguate among multiple 
readings of a word. Their approach has, however, 
been applied to English where morphological infor- 
mation has rather little use in such resolution. 

In our tagger, constraints are applied on each 
word, and check if the forms within a specified neigh- 
borhood of the word satisfy certain morphosyntactic 
or positional restrictions, and/or agreements. Our 
constraint pattern specification is very similar to 
multi-word construct specification. Use of variables, 
operators and actions, are same except that the com- 
pose actions does not make sense here. The follow- 
ing is an example constraint that is used to select 
the postpositional reading of certain word when it is 
preceded by a yet unresolved nominal form with a 
certain case. The only requirement is that the case 
of the nominal form agrees with the case sub catego- 
rization requirement of the following postposition. 
(LP = refers to current word, LP = 1 refers to 
next word.) 

LP = 0, Case = _C : Output; 

LP = 1, Cat = POSTP, Subcat = _C : Output. 

When a match is found, the matching parses from 
both words are selected and the others are discarded. 
This one constraint disambiguates almost all of the 
postpositions and their arguments, the exceptions 
being nominal words which semantically convey the 
information provided by the case (such as words in- 
dicating direction, which may be used as if they have 
a dative case). 

Finally the following example constraint deletes 
the sentence final adjectival readings derived from 
verbs, effectively preferring the verbal reading (as 
Turkish is a SOV language.) 

Cat = V, Finalcat = AD J, SP = BID : Delete. 

4 Performance of the Tagger 

We have performed some preliminary experiments 
to assess the effectiveness of our tagger. We have 
used about 250 constraints for Turkish. Some of 
these constraints are very general as the postposition 
rule above, while some are geared towards recogni- 
tion of NP's of various sorts and a small number ap- 
ply certain syntactic heuristics. In this section, we 
summarize our preliminary results. Table 1 presents 
some preliminary results about the our tagging ex- 
periments. 

Although the texts that we have experimented 
with are rather small, the results indicate that our 



approach is effective in disambiguating morpholog- 
ical structures, and hence POS, with minimal user 
intervention. Currently, the speed of the tagger is 
limited by essentially that of the morphological ana- 
lyzer, but we have ported the morphological analyzer 
to the XEROX TWOL system developed by Kart- 
tunen and Beesley (Karttunen and Beesley, 1992). 
This system can analyze Turkish word forms at 
about 1000 forms/sec on SparcStation lO's. We in- 
tend to integrate this to our tagger soon, improving 
its speed performance considerably. 

We have tested the impact of morphological dis- 
ambiguation on the performance of a LEG parser 
developed for Turkish (Giingordii, 1993; Giingordii 
and Ofiazer, 1994). The input to the parser was dis- 
ambiguated using the tool developed and the results 
were compared to the case when the parser had to 
consider all possible morphological ambiguities it- 
self. Eor a set of 80 sentences considered, it can be 
seen that (Table 2), morphological disambiguation 
enables almost a factor of two reduction in the av- 
erage number of parses generated and over a factor 
of two speed-up in time. 

5 Conclusions 

This paper has presented an overview of a tool for 
tagging text along with various issues that have 
come up in disambiguating morphological parses of 
Turkish words. We have noted that the use of con- 
straints is very effective in morphological disam- 
biguation. Preliminary results indicate that the tag- 
ger can tag about 98-99% of the texts accurately 
with very minimal user intervention, though it is 
conceivable that it may do worse on more substantial 
text - but there is certainly room for improvement in 
the mechanisms provided. The tool also provides for 
recognition of multi-word constructs that behave as 
a single syntactic and semantic entity in higher level 
analysis, and the compilation of information for fine- 
tuning of the morphological analyzer and the tagger 
itself. We, however, feel that our approach does not 
deal satisfactorily with most aspects of word-order 
freeness. We are currently working on an extension 
whereby the rules do not apply immediately but vote 
on their preferences and a final global vote tally de- 
termines the assignments. 
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Table 2: Impact of disambiguation on parsing performance 





No disambiguation 


With disambiguation 


Ratios 1 


Avg. Length 
(words) 


Avg. 
parses 


Avg. 
time (sec) 


Avg. 
parses 


Avg. 
time (sec) 


parses 


speed-up 


5.7 


5.78 


29.11 


3.30 


11.91 


1.97 


2.38 



Note: The ratios are the averages of the sentence by sentence ratios. 



E. Brill. 1992. A simple rule-based part-of-speech 
tagger. In Proceedings of the Third Conference on 
Applied Computational Linguistics, Trento, Italy. 

K. W. Church. 1988. A stochastic parts program 
and noun phrase parser for unrestricted text. In 
Proceedings of the Second Conference on Applied 
Natural Language Processing (ACL), pages 136- 
143. 

D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. 
1993. A practical part-of-speech tagger. Technical 
report, Xerox Palo Alto Research Center. 

C. Demir. 1993. An ATN grammar for Turkish. 
Master's thesis. Department of Computer Engi- 
neering and Information Sciences, Bilkent Univer- 
sity, Ankara, Turkey, July. 

Z. Giingordii and K. Oflazer. 1994. Parsing Turkish 
using the Lexical-Functional Grammar formalism. 
In Proceedings of COLLNG-94, the 15th Interna- 
tional Conference on Computational Linguistics, 
Kyoto, Japan. 

Z. Giingordii. 1993. A Lexical-Functional Gram- 
mar for Turkish. Master's thesis. Department of 
Computer Engineering and Information Sciences, 
Bilkent University, Ankara, Turkey, July. 

F. Karlsson. 1990. Constraint grammar as a frame- 
work for parsing running text. In Proceedings of 
COLLNG-90, the 13th International Conference 
on Computational Linguistics, volume 3, pages 
168-173, Helsinki, Finland. 



L. Karttunen and K. R. Beesley. 1992. Two-level 
rule compiler. Technical Report, XEROX Palo 
Alto Research Center. 

K. Koskenniemi, P. Tapanainen, and A. Voutilainen. 
1992. Compiling and using finite-state syntactic 
rules. In Proceedings of COLIN G-92, the Uth 
International Conference on Computational Lin- 
guistics, volume 1, pages 156-162, Nantes, France. 

K. Ofiazer. 1993. Two-level description of Turkish 
morphology. In Proceedings of the Sixth Confer- 
ence of the European Chapter of the Association 
for Computational Linguistics, April. A full ver- 
sion appears in Literary and Linguistic Comput- 
ing, Vol.9 No. 2, 1994. 

A. Voutilainen and P. Tapanainen. 1993. Ambiguity 
resolution in a reductionistic parser. In Proceed- 
ings of EACL'93, Utrecht, Holland. 

A. Voutilainen, J. Heikkila, and A. Anttila. 1992. 
Constraint Grammar of English. University of 
Helsinki. 



